[Feature Request] Support Dynamic Slot Allocation in RawBlock Backend · Issue #3392 · LMCache/LMCache
Label
new feature
Is your feature request related to a problem? Please describe.
Summary
Introduce a dynamic slot allocation strategy for RawBlockCore to replace the fixed-slot allocator, minimizing internal fragmentation.
Motivation
The fixed-slot allocation model introduces internal fragmentation in several common scenarios.
The most common sources of this issue are:
- Partial (unfull) chunks
- Compression-enabled workloads
In all cases, payload sizes are often smaller than the configured slot size, leaving unused space within allocated slots and reducing effective storage utilization.
1. Partial (Unfull) Chunk Allocation
Workloads often produce payloads smaller than the configured slot size.
This occurs in scenarios such as:
- Naturally incomplete (unfull) chunks when
save_unfull_chunk=True - Prefill-Decode (PD) mode, where
save_unfull_chunkis always enabled and uneven payload sizes are generated by design
Example
Fixed Slot (512 KiB)
[########################################################] 512 KiB slot
[##########..............................................] 50 KiB payload
[##########..............................................] 462 KiB unused space
Result:
- A large portion of the slot remains unused
- Fragmentation accumulates across workloads
2. Compression-Enabled Workloads
When compression is applied, the physical payload becomes smaller than the fixed slot size.
This is especially prominent in architectures such as DeepSeek-V4, where:
- Different layer groups use different compression ratios
- Physical chunk sizes vary across layers
Example
Slot (512 KiB)
Layer A (low compression)
[####################################....................]
Layer B (high compression)
[############............................................]
Layer C (very high compression)
[######..................................................]
Result:
- Highly compressed layers leave large portions of slots unused
- Fragmentation becomes uneven across layers
- Storage efficiency degrades with layer heterogeneity
Please correct me if I'm wrong
Describe the solution you'd like
A clear and concise description of what you want to happen.
Proposed Change
Replace the fixed-slot allocator with a dynamic slot allocator:
- Allocate slot size based on actual payload size
- Continue to use a bump-pointer allocation strategy for fast sequential allocation when capacity is available
- Use power-of-two size classes to bound allocation granularity and ensure predictable reuse patterns
- Maintain a segmented free list for efficient reuse of freed regions
- Provide size-aware allocation hints to avoid allocation–eviction mismatch
This design minimizes internal fragmentation and significantly improves effective storage utilization across heterogeneous workloads.
Expected Benefits
Reduced Fragmentation
Dynamic Allocation
[##########] 50 KiB slot
[##############] 80 KiB slot
[########################] 200 KiB slot
- Minimal unused space per allocation
- Improved storage utilization
Improved Efficiency
- Increased effective storage capacity
- Better reuse of freed regions
- Reduced allocation failures
- Eliminates unnecessary padding overhead from fixed-size allocation
Comments
daegyu94 · 2026-05-26
@DongDongJu
I’d appreciate your thoughts on this approach.
If this direction aligns with the overall design, I’d be happy to proceed with an implementation.
DongDongJu · 2026-05-26
This is always the problem of having fixed memory granularity.
I think we can have a linux OS + io_uring like behavior.
How about having fixed pre-allocated memory pool with different sizes like buddy allocator?
We can reuse it and even we can pin the memory to gpu side as well for improving the read performance.
WDYT?
daegyu94 · 2026-05-27
This is always the problem of having fixed memory granularity. I think we can have a linux OS + io_uring like behavior. How about having fixed pre-allocated memory pool with different sizes like buddy allocator? We can reuse it and even we can pin the memory to gpu side as well for improving the read performance. WDYT?
What I intended was slot management in a raw‑block backend, rather than the fixed buffer registration used in io_uring.
I agree that fragmentation is a classic issue in fixed memory management.
My idea is to keep the io_uring buffers fixed-size, while making the raw‑block space management dynamic instead of fixed-size.
I was also wondering whether the io_uring buffer size needs to match the raw-block allocation unit, since read/write operations can specify the actual length.If I’m misunderstanding something here, I’d really appreciate your feedback.
DongDongJu · 2026-05-27
Yeah, exactly. What I meant was not that every allocation should be fully dynamic from raw memory.
I was thinking more like linux slab/slob/slub style allocation on top of a buddy-like memory pool:
we pre-allocate multiple fixed-size buffer classes, and then reuse them based on the requested payload size.So instead of having only one fixed slot size, we can have several fixed-size pools, e.g. 64KB / 128KB / 256KB / 512KB, and allocate from the smallest class that can fit the payload.
This keeps allocation/reuse simple and predictable, while reducing the internal fragmentation caused by a single fixed slot size.
For io_uring, I think the registered buffers can still be fixed/pre-allocated. The raw-block allocator does not necessarily need to expose the same fixed size as the io_uring buffer, since read/write can still use the actual payload length.
And as you mentioned, if we manage these fixed-size pools carefully, we may also be able to pin/reuse them on the GPU side to improve read performance.
So maybe the design is:
- pre-allocated memory pools with several fixed size classes
- free list per size class
- allocate from the best-fit size class
- reuse freed buffers
- optionally pin some pools for GPU-side read optimization
This is closer to what I intended than arbitrary variable-size allocation.
WDYT?
daegyu94 · 2026-05-28
Your design looks solid.
From my understanding, the slot allocator can be structured as:
- Full chunks: fixed, model-dependent slot size
- Partial chunks: size-class pools for smaller payloads
Based on this, we can consider several design choices.
Below are my thoughts:
- Partial vs Full ratio tuning
- For example: Partial 20%, Full 80%
- The ratio should be configurable and tunable
- Size classes in Partial
- How many size classes should we have?
- What should the range be? (e.g., 64 KB ~ 1024 KB)
- Allocation fallback
- Should we allow fallback from Partial to Full? If so, we need to protect Full capacity (e.g., via reservation)
- Full → Partial fallback should be avoided due to the complexity of splitting and merging partial classes
- Note: These choices will also impact eviction policy, especially under allocation pressure or failure scenarios
I opened this issue because I thought it would be worth discussing how we should handle partial chunks.
I’m also happy to discuss this offline if that would be easier.